3¶

Question:

Please do clustering by two different algorithms with one dataset. Select one dataset public which consists of more than 10 features. Please do some tasks Exploratory Data Analysis (EDA), feature engineering and some pre-processing if you feel need it. Please explain each your task results (LO1, LO3, LO4, 17 poin)

Answer

In [ ]:
import pandas as pd
pd.set_option('future.no_silent_downcasting',True)

dt = pd.read_csv('./data.csv')
In [ ]:
# Descripbe Data Shape
print("Data Shape")
print(dt.shape)
print("--------------")

# Describe overall data
print("Data Info")
print(dt.info(memory_usage=False))
print("--------------")

print("Data Description")
print(dt.describe())
print("--------------")
Data Shape
(207, 41)
--------------
Data Info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207 entries, 0 to 206
Data columns (total 41 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             207 non-null    int64  
 1   ID                     207 non-null    int64  
 2   Country of Origin      207 non-null    object 
 3   Farm Name              205 non-null    object 
 4   Lot Number             206 non-null    object 
 5   Mill                   204 non-null    object 
 6   ICO Number             75 non-null     object 
 7   Company                207 non-null    object 
 8   Altitude               206 non-null    object 
 9   Region                 205 non-null    object 
 10  Producer               206 non-null    object 
 11  Number of Bags         207 non-null    int64  
 12  Bag Weight             207 non-null    object 
 13  In-Country Partner     207 non-null    object 
 14  Harvest Year           207 non-null    object 
 15  Grading Date           207 non-null    object 
 16  Owner                  207 non-null    object 
 17  Variety                201 non-null    object 
 18  Status                 207 non-null    object 
 19  Processing Method      202 non-null    object 
 20  Aroma                  207 non-null    float64
 21  Flavor                 207 non-null    float64
 22  Aftertaste             207 non-null    float64
 23  Acidity                207 non-null    float64
 24  Body                   207 non-null    float64
 25  Balance                207 non-null    float64
 26  Uniformity             207 non-null    float64
 27  Clean Cup              207 non-null    float64
 28  Sweetness              207 non-null    float64
 29  Overall                207 non-null    float64
 30  Defects                207 non-null    float64
 31  Total Cup Points       207 non-null    float64
 32  Moisture Percentage    207 non-null    float64
 33  Category One Defects   207 non-null    int64  
 34  Quakers                207 non-null    int64  
 35  Color                  207 non-null    object 
 36  Category Two Defects   207 non-null    int64  
 37  Expiration             207 non-null    object 
 38  Certification Body     207 non-null    object 
 39  Certification Address  207 non-null    object 
 40  Certification Contact  207 non-null    object 
dtypes: float64(13), int64(6), object(22)None
--------------
Data Description
       Unnamed: 0          ID  Number of Bags       Aroma      Flavor  \
count  207.000000  207.000000      207.000000  207.000000  207.000000   
mean   103.000000  103.000000      155.449275    7.721063    7.744734   
std     59.899917   59.899917      244.484868    0.287626    0.279613   
min      0.000000    0.000000        1.000000    6.500000    6.750000   
25%     51.500000   51.500000        1.000000    7.580000    7.580000   
50%    103.000000  103.000000       14.000000    7.670000    7.750000   
75%    154.500000  154.500000      275.000000    7.920000    7.920000   
max    206.000000  206.000000     2240.000000    8.580000    8.500000   

       Aftertaste    Acidity        Body     Balance  Uniformity  Clean Cup  \
count  207.000000  207.00000  207.000000  207.000000  207.000000      207.0   
mean     7.599758    7.69029    7.640918    7.644058    9.990338       10.0   
std      0.275911    0.25951    0.233499    0.256299    0.103306        0.0   
min      6.670000    6.83000    6.830000    6.670000    8.670000       10.0   
25%      7.420000    7.50000    7.500000    7.500000   10.000000       10.0   
50%      7.580000    7.67000    7.670000    7.670000   10.000000       10.0   
75%      7.750000    7.87500    7.750000    7.790000   10.000000       10.0   
max      8.420000    8.58000    8.250000    8.420000   10.000000       10.0   

       Sweetness     Overall  Defects  Total Cup Points  Moisture Percentage  \
count      207.0  207.000000    207.0        207.000000           207.000000   
mean        10.0    7.676812      0.0         83.706570            10.735266   
std          0.0    0.306359      0.0          1.730417             1.247468   
min         10.0    6.670000      0.0         78.000000             0.000000   
25%         10.0    7.500000      0.0         82.580000            10.100000   
50%         10.0    7.670000      0.0         83.750000            10.800000   
75%         10.0    7.920000      0.0         84.830000            11.500000   
max         10.0    8.580000      0.0         89.330000            13.500000   

       Category One Defects     Quakers  Category Two Defects  
count            207.000000  207.000000            207.000000  
mean               0.135266    0.690821              2.251208  
std                0.592070    1.686918              2.950183  
min                0.000000    0.000000              0.000000  
25%                0.000000    0.000000              0.000000  
50%                0.000000    0.000000              1.000000  
75%                0.000000    1.000000              3.000000  
max                5.000000   12.000000             16.000000  
--------------
In [ ]:
from ydata_profiling import ProfileReport

ProfileReport(dt, title="Profiling Report")
Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]
Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]
Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]
Out[ ]:

Data preprocessing¶

Column Type Miss Imb Zero Desc
Variety Cat X - -
Processing Method Cat X X - Add Na as new value
Aroma Real - - -
Flavor Real - - -
Aftertaste Real - - -
Acidity Real - - -
Body Real - - -
Balance Real - - -
Moisture Percentage Real - - -
Cat One Defects Real - - X
Cat Two Defects Real - - X
Color Cat - - -
In [ ]:
from sklearn.preprocessing import MinMaxScaler, LabelEncoder

df = dt.copy()
df = df[['Variety', 'Processing Method','Aroma', 'Flavor', 'Aftertaste', 'Acidity', 'Body', 'Balance','Overall', 'Moisture Percentage', 'Category One Defects','Category Two Defects', 'Color']]

# fill missing value
df.fillna(value={"Processing Method": "other"})


# Encode Categorical Value
label_encoder = LabelEncoder()
df['Variety'] = label_encoder.fit_transform(df['Variety'])
df['Processing Method'] = label_encoder.fit_transform(df['Processing Method'])
df['Color'] = label_encoder.fit_transform(df['Color'])

# Normalisation using min_max scaller
mms = MinMaxScaler()
for i in df.columns.to_list():
    df[i] = mms.fit_transform(df[[i]])
In [ ]:
df
Out[ ]:
Variety Processing Method Aroma Flavor Aftertaste Acidity Body Balance Overall Moisture Percentage Category One Defects Category Two Defects Color
0 0.083333 0.1 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.874074 0.0 0.1875 0.363636
1 0.395833 0.8 0.961538 1.000000 0.714286 0.668571 0.767606 0.902857 0.958115 0.777778 0.0 0.0000 0.000000
2 0.416667 0.7 0.879808 0.954286 0.805714 0.765714 0.767606 0.857143 0.869110 0.770370 0.0 0.1250 1.000000
3 0.395833 0.8 0.759615 0.811429 0.857143 0.811429 0.943662 0.805714 0.827225 0.874074 0.0 0.0000 0.363636
4 0.604167 0.3 0.879808 0.902857 0.805714 0.811429 0.767606 0.714286 0.827225 0.859259 0.0 0.1250 0.909091
... ... ... ... ... ... ... ... ... ... ... ... ... ...
202 0.520833 0.4 0.322115 0.240000 0.142857 0.194286 0.415493 0.285714 0.214660 0.844444 0.0 0.2500 0.363636
203 0.645833 0.4 0.399038 0.188571 0.045714 0.194286 0.415493 0.285714 0.214660 0.770370 0.0 0.7500 0.363636
204 0.166667 0.8 0.360577 0.240000 0.234286 0.097143 0.176056 0.234286 0.172775 0.859259 0.0 0.6875 0.363636
205 0.500000 0.4 0.000000 0.000000 0.045714 0.194286 0.176056 0.188571 0.083770 0.814815 0.0 0.8125 0.090909
206 0.520833 0.6 0.360577 0.188571 0.000000 0.000000 0.000000 0.000000 0.000000 0.837037 0.0 0.0625 0.363636

207 rows × 13 columns

Data processing¶

In [ ]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

%matplotlib inline

# Finding optimum k value using elbow method
wcss = []  
silhouttes = []
max_k = 15
for i in range(2, max_k):
    kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10)
    kmeans.fit(df)
    if i > 1:
        cluster_pred = kmeans.fit_predict(df)
        s_avg = silhouette_score(df, cluster_pred) # find silhoutte
    silhouttes.append(s_avg)
    wcss.append(kmeans.inertia_) # find wcss

# Plot the Elbow Method
plt.plot(range(2, max_k), wcss)
plt.title('Elbow Value')
plt.xlabel('K value')
plt.ylabel('WCSS')
plt.show()

# plot silhoutte
plt.plot(range(2, max_k), silhouttes)
plt.title('Silhoutte Value')
plt.xlabel('K value')
plt.ylabel('Silhoutte')
plt.show()


final_k = 3
kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10)
cluster_labels = kmeans.fit_predict(df)
No description has been provided for this image
No description has been provided for this image
In [ ]:
# Visualizing result using PCA
from sklearn.decomposition import PCA

pca = PCA(n_components=10)
pca_result = pca.fit_transform(df)

features = ['Aroma', 'Flavor', 'Aftertaste', 'Acidity', 'Body', 'Balance', 'Overall', 'Moisture Percentage', 'Category One Defects', 'Category Two Defects', 'Color']
df['Cluster'] = cluster_labels
In [ ]:
for feature in features:
    plt.figure(figsize=(8, 6))
    
    # Scatter plot
    plt.scatter(pca_result[:, 0], pca_result[:, 1], c=df['Cluster'], cmap='viridis')
    
    # Adding labels
    plt.title(f'PCA Visualization for {feature} with Cluster Labels')
    plt.xlabel('Principal Component 1')
    plt.ylabel('Principal Component 2')
    
    # Adding legend
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image